The rapid growth of mobile communication has resulted in a significant increase in unsolicited and fraudulent spam messages. These messages not only inconvenience users but also pose serious security and privacy risks. Manual filtering of spam messages is inefficient due to the large volume of data generated daily. This paper proposes an intelligent SMS spam detection system using ensemble machine learning techniques. Text preprocessing is applied to clean and normalize SMS content, followed by feature extraction using the Term Frequency–Inverse Document Frequency (TF-IDF) method. Multiple machine learning classifiers, including Naive Bayes, Logistic Regression, and Support Vector Machine, are trained and combined using a voting-based ensemble approach. Experimental results on the SMS Spam Collection dataset demonstrate that the ensemble model outperforms individual classifiers by achieving higher accuracy and reduced false positives. The proposed system provides an effective and reliable solution for automated spam detection.
Introduction
The text discusses the development of an intelligent SMS spam detection system using ensemble machine learning techniques to improve accuracy and reliability.
Background:
SMS is widely used due to its simplicity and low cost, but its popularity has led to a rise in spam messages—including advertisements, phishing attempts, and fraud—which can cause financial loss and privacy breaches. Traditional rule-based spam filters are inflexible and struggle to adapt to evolving spam patterns. Machine learning approaches offer more robust detection by learning from historical data.
Related Work:
Early spam detection relied on rule-based filtering, which was manually intensive and limited.
Probabilistic classifiers (e.g., Naive Bayes) improved handling of text data.
Logistic Regression and Support Vector Machines (SVMs) offered strong generalization.
Ensemble learning—combining multiple classifiers—reduces bias and variance, enhancing performance.
Deep learning models exist but require large datasets and high computational resources.
Dataset:
The study uses the SMS Spam Collection dataset from Kaggle, containing labeled SMS messages as spam or ham (non-spam). The dataset was cleaned and preprocessed for model training.
Methodology:
Text Preprocessing: Convert to lowercase, remove punctuation, special characters, extra spaces, and stopwords.
Feature Extraction: Use TF-IDF to convert text into numerical features.
Model Training: Train Naive Bayes, Logistic Regression, and SVM individually on an 80:20 train-test split.
Ensemble Learning: Combine predictions using hard voting, where the majority vote determines the final class.
Individual classifiers performed well, with SVM outperforming Naive Bayes and Logistic Regression.
The ensemble model achieved the best overall performance, effectively combining the strengths of all classifiers.
Confusion matrix analysis showed the ensemble approach significantly reduced false positives, ensuring legitimate messages are less likely to be misclassified as spam.
Conclusion
This paper presented an intelligent SMS spam detection system using ensemble machine learning techniques. By combining multiple classifiers through a voting-based ensemble approach, the proposed system achieved improved performance compared to individual models. The results demonstrate the effectiveness of ensemble learning in reducing false positives and enhancing spam detection accuracy.
References
[1] T. Almeida et al., “Contributions to the Study of SMS Spam Filtering,” ACM Symposium, 2011.
[2] H. Drucker et al., “Support Vector Machines for Spam Categorization,” IEEE Transactions on Neural Networks, 1999.
[3] J. Ramos, “Using TF-IDF to Determine Word Relevance in Document Queries,” 2003.
[4] C. Cortes and V. Vapnik, “Support Vector Networks,” Machine Learning, 1995.
[5] L. Breiman, “Random Forests,” Machine Learning, 2001.
[6] Kaggle, “SMS Spam Collection Dataset,” Kaggle Repository.
[7] T. Mitchell, Machine Learning, McGraw-Hill, 1997.